Project: Ensemble Techniques

Problem statement -Term Deposit Sale

Goal:

Using the data collected from existing customers, build a model that will help the marketing team identify potential customers who are elatively more likely to subscribe term deposit and thus increase their hit ratio.

Attribute information

Input variables:

Bank client data:

  1. age: Continuous feature
  2. job: Type of job (management, technician, entrepreneur, blue-collar, etc.)
  3. marital: marital status (married, single, divorced)
  4. education: education level (primary, secondary, tertiary)
  5. default: has credit in default?
  6. housing: has housing loan?
  7. loan: has personal loan?
  8. balance in account
  1. contact: contact communication type
  2. month: last contact month of year
  3. day: last contact day of the month
  4. duration: last contact duration, in seconds*

Other attributes:

  1. campaign: number of contacts performed during this campaign and for this client.
  2. pdays: number of days that passed by after the client was last contacted from a previous campaign (-1 tells us the person has not been contacted or contact period is beyond 900 days).
  3. previous: number of contacts performed before this campaign and for this client.
  4. poutcome: outcome of the previous marketing campaign

Output variable (desired target):

  1. Target: Tell us has the client subscribed a term deposit. (Yes, No)

Import Libraries

In [24]:
import warnings
warnings.filterwarnings('ignore')
In [25]:
import pandas as pd
import numpy as np

from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer  #DT does not take strings as input for the model fit step....
from IPython.display import Image  
#import pydotplus as pydot
from sklearn import tree
from os import system
In [54]:
#plt.style.use('ggplot')
pd.options.display.float_format = '{:,.2f}'.format
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
In [27]:
# Below we will read the data from the local folder
df = pd.read_csv("bank-full.csv")

# Now display the header 
print ('Bank-Full.csv data set:')
df.head(10)
Bank-Full.csv data set:
Out[27]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
7 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
In [117]:
df.tail() ## to know how the end of the data looks like
Out[117]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
45206 51 technician married tertiary no 825 no no cellular 17 nov 977 3 -1 0 unknown yes
45207 71 retired divorced primary no 1729 no no cellular 17 nov 456 2 -1 0 unknown yes
45208 72 retired married secondary no 5715 no no cellular 17 nov 1127 5 184 3 success yes
45209 57 blue-collar married secondary no 668 no no telephone 17 nov 508 4 -1 0 unknown no
45210 37 entrepreneur married secondary no 2971 no no cellular 17 nov 361 2 188 11 other no
  • ### 1.1 Univariate analysis
In [163]:
df.info() # here we will see the number of entires(rows and columns), dtype, and non-nullcount
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  Target     45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

Insight:

  • There are no non-null values, i.e., we have a value for every column and row. There are 10 varialbes need to be changed from object to categorical in coming stetps.
In [28]:
df.shape # size of the data set also shown in the cell above
Out[28]:
(45211, 17)
In [122]:
neg_exp=df[df.pdays.lt(0)] # this is to see the number of negative values present 
print (" the number of negative entries is",len(neg_exp.index)) 
# the output might be taken in consideration later on in the calculations.
 the number of negative entries is 36954
In [154]:
df.describe().transpose() # Transpose is used here  to read better the attribute
Out[154]:
count mean std min 25% 50% 75% max
age 45,211.00 40.94 10.62 18.00 33.00 39.00 48.00 95.00
balance 45,211.00 1,362.27 3,044.77 -8,019.00 72.00 448.00 1,428.00 102,127.00
day 45,211.00 15.81 8.32 1.00 8.00 16.00 21.00 31.00
duration 45,211.00 258.16 257.53 0.00 103.00 180.00 319.00 4,918.00
campaign 45,211.00 2.76 3.10 1.00 1.00 2.00 3.00 63.00
pdays 45,211.00 40.20 100.13 -1.00 -1.00 -1.00 -1.00 871.00
previous 45,211.00 0.58 2.30 0.00 0.00 0.00 0.00 275.00
In [37]:
df.nunique() # Number of unique values in a column 
# this help to identify categorical values.
Out[37]:
age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
pdays         559
previous       41
poutcome        4
Target          2
dtype: int64

Insights:

  • Based on the results above, we expect to have at least 6 categorical vaiables, those that only have from 2 to 4 type of entry. the variable "job" is categorical since it is related to types of jobs as well as 'month' since they are discrete imputs. the categorical variables are : ['job','marital', 'education','default','housing','loan','contact','month','poutcome','Target']
In [46]:
 # Now we will get a list of unique values to evalaute how to arrange the data set
for a in list(df.columns):
    n = df[a].unique()
    
    # if number of unique values is less than 30, print the values. Otherwise print the number of unique values
    if len(n)<30:
        print(a + ': ')
        print(df[a].value_counts(normalize=True))
        print()
    else:
        print(a + ': ' +str(len(n)) + ' unique values')
        print()
age: 77 unique values

job: 
blue-collar     0.22
management      0.21
technician      0.17
admin.          0.11
services        0.09
retired         0.05
self-employed   0.03
entrepreneur    0.03
unemployed      0.03
housemaid       0.03
student         0.02
unknown         0.01
Name: job, dtype: float64

marital: 
married    0.60
single     0.28
divorced   0.12
Name: marital, dtype: float64

education: 
secondary   0.51
tertiary    0.29
primary     0.15
unknown     0.04
Name: education, dtype: float64

default: 
no    0.98
yes   0.02
Name: default, dtype: float64

balance: 7168 unique values

housing: 
yes   0.56
no    0.44
Name: housing, dtype: float64

loan: 
no    0.84
yes   0.16
Name: loan, dtype: float64

contact: 
cellular    0.65
unknown     0.29
telephone   0.06
Name: contact, dtype: float64

day: 31 unique values

month: 
may   0.30
jul   0.15
aug   0.14
jun   0.12
nov   0.09
apr   0.06
feb   0.06
jan   0.03
oct   0.02
sep   0.01
mar   0.01
dec   0.00
Name: month, dtype: float64

duration: 1573 unique values

campaign: 48 unique values

pdays: 559 unique values

previous: 41 unique values

poutcome: 
unknown   0.82
failure   0.11
other     0.04
success   0.03
Name: poutcome, dtype: float64

Target: 
no    0.88
yes   0.12
Name: Target, dtype: float64

Insights:

  • The 82% of the colomn poutcomeis unknown, this column doesn't seems to add value to the calculations.Also the success category is 3%
  • Most of the calls were done in May and there was no calls in december. This could be a keyfactor to consider for future campaign. We will see at the end of the project. -The main contact type is by celular with 65%
In [49]:
for feature in df.columns: # Loop through all columns in the dataframe
    if df[feature].dtype == 'object': # Only apply for columns with categorical strings
        df[feature] = pd.Categorical(df[feature])# Replace strings with an integer

print("This is the new Dtype for the datase")
print()
print (df.dtypes)
df.head(10)
This is the new Dtype for the datase

age             int64
job          category
marital      category
education    category
default      category
balance         int64
housing      category
loan         category
contact      category
day             int64
month        category
duration        int64
campaign        int64
pdays           int64
previous        int64
poutcome     category
Target       category
dtype: object
Out[49]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
5 35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
6 28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
7 42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
8 58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
9 43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
In [8]:
df.boxplot(column="pdays",return_type='axes',figsize=(8,8))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b460b461f0>
In [10]:
df.boxplot(column="campaign",return_type='axes',figsize=(8,8))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b461db3d00>
In [12]:
df.boxplot(column="balance",return_type='axes',figsize=(8,8))
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b461e5e880>
In [11]:
df.boxplot(column="duration",return_type='axes',figsize=(8,8))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b461e10d30>

Insight:

  • The boxplots above are not giving muh information. Data seesm to be continue covering a wide range of values
  • the pdays column seems to have continue values as seen in the boxplot and maybe will no add any value to the calculation.
  • these boxplot will be evalauted with histograms below
In [55]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
sns.boxplot(x = 'age', data = df, orient = 'v', ax = ax1)
ax1.set_xlabel('People Age', fontsize=15)
ax1.set_ylabel('Age', fontsize=15)
ax1.set_title('Age Distribution', fontsize=15)
ax1.tick_params(labelsize=15)

sns.distplot(df['age'], ax = ax2)
sns.despine(ax = ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.set_ylabel('Occurence', fontsize=15)
ax2.set_title('Age x Ocucurence', fontsize=15)
ax2.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
In [57]:
# For them moment we can identify 8 items that has more than 2 type of entries or continues value
no_categ_df= ['age', 'balance','day', 'duration', 'campaign', 'pdays','previous']
df[no_categ_df].hist(stacked=False, bins=50, figsize=(30,30), layout=(4,2)); # Histogram of the tentative continous variable

#*** Please notice that some of these variable can be change to categorical inputs or dropped after the graphical evaluation.

Insights:

  • From the graphics above it is possible to see that the campaign (number of contact) is less than 10 and the average seen in the description above is 2
  • All the plots,excep age and days, are strongly positive sknewness where the median is greater than the mode.
In [16]:
df.columns # this line is to get the name of the column to be used below
Out[16]:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'Target'],
      dtype='object')
In [71]:
Categ_df = ['job','marital', 'education','default','housing','loan','contact','month','poutcome','Target']
for i in Categ_df :   # checking value counts of the list "Categ_df"
    print(df[i].value_counts(normalize=True))
    print()
blue-collar     0.22
management      0.21
technician      0.17
admin.          0.11
services        0.09
retired         0.05
self-employed   0.03
entrepreneur    0.03
unemployed      0.03
housemaid       0.03
student         0.02
unknown         0.01
Name: job, dtype: float64

married    0.60
single     0.28
divorced   0.12
Name: marital, dtype: float64

secondary   0.51
tertiary    0.29
primary     0.15
unknown     0.04
Name: education, dtype: float64

no    0.98
yes   0.02
Name: default, dtype: float64

yes   0.56
no    0.44
Name: housing, dtype: float64

no    0.84
yes   0.16
Name: loan, dtype: float64

cellular    0.65
unknown     0.29
telephone   0.06
Name: contact, dtype: float64

may   0.30
jul   0.15
aug   0.14
jun   0.12
nov   0.09
apr   0.06
feb   0.06
jan   0.03
oct   0.02
sep   0.01
mar   0.01
dec   0.00
Name: month, dtype: float64

unknown   0.82
failure   0.11
other     0.04
success   0.03
Name: poutcome, dtype: float64

no    0.88
yes   0.12
Name: Target, dtype: float64

Will proceeed to plot the categorical values for better visualization and decide whether replace values (yes=1 /NO=0) or apply one-hot encoding

In [64]:
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'education', data = df[Categ_df])
ax.set_xlabel('Education Receieved', fontsize=16)
ax.set_ylabel('Count', fontsize=16)
ax.set_title('Education', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
In [66]:
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'marital', data = df[Categ_df])
ax.set_xlabel('Marital Status', fontsize=16)
ax.set_ylabel('Count', fontsize=16)
ax.set_title('Marital', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
In [67]:
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'job', data = df[Categ_df])
ax.set_xlabel('Types of Jobs', fontsize=16)
ax.set_ylabel('Number', fontsize=16)
ax.set_title('Job', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
In [68]:
fig, ax = plt.subplots()
fig.set_size_inches(25, 8)
sns.countplot(x = 'poutcome', data = df[Categ_df])
ax.set_xlabel('Previous Marketing Campaign Outcome', fontsize=16)
ax.set_ylabel('Number of Previous Outcomes', fontsize=16)
ax.set_title('poutcome (Previous Marketing Campaign Outcome)', fontsize=16)
ax.tick_params(labelsize=16)
sns.despine()
In [62]:
fig, (ax1, ax2, ax3) = plt.subplots(nrows = 1, ncols = 3, figsize = (20,8))
sns.countplot(x = 'default', data = df[Categ_df], ax = ax1, order = ['no', 'yes'])
ax1.set_title('Default', fontsize=15)
ax1.set_xlabel('')
ax1.set_ylabel('Count', fontsize=15)
ax1.tick_params(labelsize=15)

sns.countplot(x = 'housing', data = df[Categ_df], ax = ax2, order = ['no', 'yes'])
ax2.set_title('Housing', fontsize=15)
ax2.set_xlabel('')
ax2.set_ylabel('Count', fontsize=15)
ax2.tick_params(labelsize=15)

sns.countplot(x = 'loan', data = df[Categ_df], ax = ax3, order = ['no', 'yes'])
ax3.set_title('Loan', fontsize=15)
ax3.set_xlabel('')
ax3.set_ylabel('Count', fontsize=15)
ax3.tick_params(labelsize=15)

plt.subplots_adjust(wspace=0.25)
In [70]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (15,6))
sns.countplot(df[Categ_df]['contact'], ax = ax1)
ax1.set_xlabel('Contact', fontsize = 10)
ax1.set_ylabel('Count', fontsize = 10)
ax1.set_title('Contact Counts')
ax1.tick_params(labelsize=10)

sns.countplot(df[Categ_df]['month'], ax = ax2, order = ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])
ax2.set_xlabel('Months', fontsize = 10)
ax2.set_ylabel('')
ax2.set_title('Months Counts')
ax2.tick_params(labelsize=10)

plt.subplots_adjust(wspace=0.25)
In [ ]:
#Based on the plots above, one-hot encoding will be applied to the the categorical variables
##with more than 2 classifications.
##THIS IS PART OF THE QUESTION 2: PREPARING THE DATA FOR THE MODEL
replaceStruct = {
                "default":{"no": 0, "yes": 1 },
                "housing":{"no": 0, "yes": 1 },
                "loan":  {"no": 0, "yes": 1  },
                "Target":{"no":0, "yes":1},
                    } # All boolean columns will be changed to 1 and 0
oneHotCols=["marital","education","contact","poutcome","job","month"]
In [73]:
df=df.replace(replaceStruct)
df=pd.get_dummies(df, columns=oneHotCols)
df.head(10)
Out[73]:
age default balance housing loan day duration campaign pdays previous ... month_dec month_feb month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep
0 58 0 2143 1 0 5 261 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
1 44 0 29 1 0 5 151 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
2 33 0 2 1 1 5 76 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
3 47 0 1506 1 0 5 92 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
4 33 0 1 0 0 5 198 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
5 35 0 231 1 0 5 139 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
6 28 0 447 1 1 5 217 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
7 42 1 2 1 0 5 380 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
8 58 0 121 1 0 5 50 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
9 43 0 593 1 0 5 55 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0

10 rows × 49 columns

In [74]:
print (df.dtypes) ## This is to make sure there is not object in dataset (as it happened before due to a failure in the cell above)
age                    int64
default                int64
balance                int64
housing                int64
loan                   int64
day                    int64
duration               int64
campaign               int64
pdays                  int64
previous               int64
Target                 int64
marital_divorced       uint8
marital_married        uint8
marital_single         uint8
education_primary      uint8
education_secondary    uint8
education_tertiary     uint8
education_unknown      uint8
contact_cellular       uint8
contact_telephone      uint8
contact_unknown        uint8
poutcome_failure       uint8
poutcome_other         uint8
poutcome_success       uint8
poutcome_unknown       uint8
job_admin.             uint8
job_blue-collar        uint8
job_entrepreneur       uint8
job_housemaid          uint8
job_management         uint8
job_retired            uint8
job_self-employed      uint8
job_services           uint8
job_student            uint8
job_technician         uint8
job_unemployed         uint8
job_unknown            uint8
month_apr              uint8
month_aug              uint8
month_dec              uint8
month_feb              uint8
month_jan              uint8
month_jul              uint8
month_jun              uint8
month_mar              uint8
month_may              uint8
month_nov              uint8
month_oct              uint8
month_sep              uint8
dtype: object
  • ### 1.2 Multivariate analysis (8 marks)
    • ### Correlation method to observe the relationship between different variables
In [75]:
df.corr() # with this function is hard to see any correlation. 
Out[75]:
age default balance housing loan day duration campaign pdays previous ... month_dec month_feb month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep
age 1.00 -0.02 0.10 -0.19 -0.02 -0.01 -0.00 0.00 -0.02 0.00 ... 0.02 -0.00 -0.01 0.00 0.05 0.02 -0.13 0.03 0.06 0.03
default -0.02 1.00 -0.07 -0.01 0.08 0.01 -0.01 0.02 -0.03 -0.02 ... -0.01 -0.01 -0.01 0.04 0.01 -0.01 -0.00 0.01 -0.02 -0.01
balance 0.10 -0.07 1.00 -0.07 -0.08 0.00 0.02 -0.01 0.00 0.02 ... 0.02 -0.00 -0.02 -0.06 0.03 0.02 -0.07 0.12 0.04 0.02
housing -0.19 -0.01 -0.07 1.00 0.04 -0.03 0.01 -0.02 0.12 0.04 ... -0.05 -0.06 -0.07 -0.06 -0.10 -0.07 0.43 0.00 -0.09 -0.08
loan -0.02 0.08 -0.08 0.04 1.00 0.01 -0.01 0.01 -0.02 -0.01 ... -0.02 -0.01 -0.00 0.17 -0.02 -0.03 -0.03 0.02 -0.03 -0.03
day -0.01 0.01 0.00 -0.03 0.01 1.00 -0.03 0.16 -0.09 -0.05 ... -0.01 -0.28 0.25 0.15 -0.19 -0.02 -0.03 0.10 0.03 -0.05
duration -0.00 -0.01 0.02 0.01 -0.01 -0.03 1.00 -0.08 -0.00 0.00 ... 0.02 -0.01 0.01 0.02 -0.02 -0.01 0.01 -0.01 0.02 0.02
campaign 0.00 0.02 -0.01 -0.02 0.01 0.16 -0.08 1.00 -0.09 -0.03 ... -0.01 -0.03 -0.06 0.10 0.04 -0.02 -0.07 -0.08 -0.05 -0.04
pdays -0.02 -0.03 0.00 0.12 -0.02 -0.09 -0.00 -0.09 1.00 0.45 ... 0.05 0.07 0.05 -0.14 -0.11 0.03 0.08 0.01 0.06 0.08
previous 0.00 -0.02 0.02 0.04 -0.01 -0.05 0.00 -0.03 0.45 1.00 ... 0.04 0.07 0.05 -0.08 -0.06 0.03 0.00 0.04 0.05 0.06
Target 0.03 -0.02 0.05 -0.14 -0.07 -0.03 0.39 -0.07 0.10 0.09 ... 0.08 0.04 -0.01 -0.03 -0.02 0.13 -0.10 -0.01 0.13 0.12
marital_divorced 0.16 0.02 -0.02 0.00 0.02 -0.00 0.01 -0.02 0.00 -0.00 ... -0.00 -0.00 0.00 0.02 0.01 -0.00 0.01 0.01 -0.00 -0.01
marital_married 0.29 -0.01 0.03 0.02 0.04 0.01 -0.02 0.03 -0.03 -0.01 ... -0.01 -0.03 -0.04 0.02 0.02 -0.02 -0.04 0.02 -0.01 -0.01
marital_single -0.43 0.00 -0.01 -0.02 -0.05 -0.01 0.02 -0.02 0.03 0.02 ... 0.01 0.04 0.04 -0.04 -0.03 0.02 0.03 -0.03 0.01 0.02
education_primary 0.20 0.00 -0.02 0.01 -0.01 -0.02 -0.00 0.01 -0.02 -0.02 ... -0.01 -0.02 -0.02 0.01 0.06 -0.02 0.04 -0.03 -0.01 -0.01
education_secondary -0.09 0.01 -0.07 0.10 0.07 -0.01 0.00 -0.02 0.02 -0.01 ... -0.01 -0.00 -0.00 0.02 -0.02 -0.03 0.08 -0.02 -0.02 -0.03
education_tertiary -0.08 -0.02 0.08 -0.10 -0.05 0.02 0.00 0.01 -0.01 0.02 ... 0.01 0.02 0.01 -0.02 -0.04 0.04 -0.12 0.05 0.03 0.03
education_unknown 0.07 -0.00 0.01 -0.05 -0.05 0.00 -0.00 0.01 -0.01 -0.01 ... 0.01 -0.00 0.01 -0.00 0.03 0.01 -0.00 -0.02 0.01 0.02
contact_cellular -0.07 -0.01 0.02 -0.16 0.01 0.02 0.03 -0.03 0.23 0.13 ... 0.02 0.13 0.10 0.17 -0.39 0.05 -0.36 0.16 0.03 0.04
contact_telephone 0.17 -0.02 0.04 -0.08 -0.01 0.02 -0.02 0.05 0.02 0.03 ... 0.03 0.04 0.02 0.10 -0.07 0.02 -0.08 0.04 0.06 0.02
contact_unknown -0.02 0.02 -0.04 0.21 -0.01 -0.03 -0.01 0.00 -0.25 -0.15 ... -0.04 -0.16 -0.11 -0.23 0.45 -0.06 0.43 -0.19 -0.06 -0.05
poutcome_failure -0.00 -0.03 0.01 0.11 -0.00 -0.07 -0.02 -0.09 0.70 0.35 ... 0.02 0.07 0.06 -0.13 -0.10 0.02 0.03 0.09 0.04 0.04
poutcome_other -0.02 -0.01 0.01 0.04 -0.01 -0.03 -0.00 -0.02 0.39 0.31 ... 0.03 0.07 0.06 -0.07 -0.05 0.02 0.01 0.02 0.03 0.04
poutcome_success 0.04 -0.02 0.04 -0.09 -0.05 -0.03 0.04 -0.06 0.23 0.20 ... 0.08 0.03 0.01 -0.04 -0.02 0.05 -0.06 0.00 0.10 0.12
poutcome_unknown -0.00 0.04 -0.03 -0.06 0.03 0.09 -0.00 0.11 -0.87 -0.53 ... -0.07 -0.11 -0.08 0.16 0.12 -0.05 -0.00 -0.09 -0.09 -0.11
job_admin. -0.06 -0.01 -0.03 0.04 0.03 -0.01 -0.02 -0.02 0.03 0.01 ... -0.00 0.00 0.01 0.02 -0.00 0.01 0.03 -0.01 0.01 0.01
job_blue-collar -0.04 0.01 -0.05 0.18 0.02 -0.02 0.01 0.01 0.02 -0.02 ... -0.03 -0.04 -0.04 -0.01 0.02 -0.04 0.17 -0.05 -0.04 -0.04
job_entrepreneur 0.02 0.03 0.01 0.01 0.04 -0.00 -0.00 0.00 -0.01 -0.01 ... -0.01 -0.00 -0.01 0.03 0.02 -0.02 -0.01 0.05 -0.01 -0.01
job_housemaid 0.09 -0.00 0.00 -0.08 -0.02 0.00 -0.01 0.00 -0.03 -0.02 ... 0.00 -0.01 -0.01 0.03 0.05 -0.00 -0.07 -0.01 0.01 -0.00
job_management -0.02 -0.00 0.07 -0.06 -0.04 0.02 -0.01 0.02 -0.01 0.02 ... 0.00 0.00 -0.00 -0.01 -0.03 0.02 -0.08 0.05 0.01 0.02
job_retired 0.45 -0.01 0.05 -0.16 -0.01 -0.01 0.03 -0.03 -0.01 0.01 ... 0.04 0.02 0.01 -0.00 0.01 0.04 -0.07 -0.02 0.08 0.06
job_self-employed -0.01 0.00 0.02 -0.03 -0.01 0.01 0.01 0.01 -0.01 -0.00 ... -0.00 0.01 0.00 0.00 0.01 -0.00 -0.03 0.04 0.00 -0.01
job_services -0.07 0.00 -0.04 0.07 0.04 -0.01 0.00 -0.00 0.01 -0.01 ... -0.01 -0.01 0.00 0.03 0.01 -0.02 0.05 -0.02 -0.03 -0.02
job_student -0.20 -0.02 0.00 -0.09 -0.06 -0.02 -0.01 -0.02 0.02 0.02 ... 0.03 0.03 0.01 -0.03 -0.01 0.04 -0.01 -0.02 0.03 0.05
job_technician -0.07 -0.00 -0.02 -0.01 0.01 0.03 -0.01 0.02 -0.01 -0.00 ... -0.00 -0.01 0.00 -0.02 -0.04 -0.01 -0.04 -0.01 -0.01 -0.02
job_unemployed 0.00 0.01 0.01 -0.05 -0.04 -0.01 0.02 -0.02 -0.01 -0.01 ... 0.00 0.08 0.05 -0.01 0.00 0.01 -0.04 0.02 0.00 0.01
job_unknown 0.05 -0.01 0.01 -0.08 -0.03 -0.01 -0.01 0.01 -0.02 -0.01 ... -0.00 -0.00 0.01 -0.01 0.05 -0.00 -0.03 -0.01 0.01 0.01
month_apr -0.03 -0.03 0.02 0.08 -0.03 0.05 0.04 -0.07 0.14 0.05 ... -0.02 -0.07 -0.05 -0.11 -0.10 -0.03 -0.17 -0.08 -0.03 -0.03
month_aug 0.07 -0.01 0.01 -0.31 -0.07 0.03 -0.04 0.15 -0.11 -0.05 ... -0.03 -0.10 -0.07 -0.17 -0.15 -0.04 -0.26 -0.12 -0.05 -0.05
month_dec 0.02 -0.01 0.02 -0.05 -0.02 -0.01 0.02 -0.01 0.05 0.04 ... 1.00 -0.02 -0.01 -0.03 -0.03 -0.01 -0.05 -0.02 -0.01 -0.01
month_feb -0.00 -0.01 -0.00 -0.06 -0.01 -0.28 -0.01 -0.03 0.07 0.07 ... -0.02 1.00 -0.04 -0.11 -0.09 -0.03 -0.17 -0.08 -0.03 -0.03
month_jan -0.01 -0.01 -0.02 -0.07 -0.00 0.25 0.01 -0.06 0.05 0.05 ... -0.01 -0.04 1.00 -0.08 -0.07 -0.02 -0.12 -0.06 -0.02 -0.02
month_jul 0.00 0.04 -0.06 -0.06 0.17 0.15 0.02 0.10 -0.14 -0.08 ... -0.03 -0.11 -0.08 1.00 -0.16 -0.04 -0.28 -0.13 -0.05 -0.05
month_jun 0.05 0.01 0.03 -0.10 -0.02 -0.19 -0.02 0.04 -0.11 -0.06 ... -0.03 -0.09 -0.07 -0.16 1.00 -0.04 -0.24 -0.11 -0.05 -0.04
month_mar 0.02 -0.01 0.02 -0.07 -0.03 -0.02 -0.01 -0.02 0.03 0.03 ... -0.01 -0.03 -0.02 -0.04 -0.04 1.00 -0.07 -0.03 -0.01 -0.01
month_may -0.13 -0.00 -0.07 0.43 -0.03 -0.03 0.01 -0.07 0.08 0.00 ... -0.05 -0.17 -0.12 -0.28 -0.24 -0.07 1.00 -0.21 -0.09 -0.08
month_nov 0.03 0.01 0.12 0.00 0.02 0.10 -0.01 -0.08 0.01 0.04 ... -0.02 -0.08 -0.06 -0.13 -0.11 -0.03 -0.21 1.00 -0.04 -0.04
month_oct 0.06 -0.02 0.04 -0.09 -0.03 0.03 0.02 -0.05 0.06 0.05 ... -0.01 -0.03 -0.02 -0.05 -0.05 -0.01 -0.09 -0.04 1.00 -0.01
month_sep 0.03 -0.01 0.02 -0.08 -0.03 -0.05 0.02 -0.04 0.08 0.06 ... -0.01 -0.03 -0.02 -0.05 -0.04 -0.01 -0.08 -0.04 -0.01 1.00

49 rows × 49 columns

In [77]:
#Another correlation methods
plt.figure(figsize=(30,60))
sns.heatmap(df.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")
plt.show()
In [ ]:
sns.pairplot(df, hue = 'Target') ## After several trials, it didn't finished in this computer

Insights

  • The target variable has a small correlattion with the month of September, october and march. Surprisingly it has not correlation with May and it was the month with more phone calls. It also has a minor correlation with the celular phone calls as it was the main way of communication.
  • The target has the better correlation with the duration of the phone call. This variable was thought to be dropped from the data set because of the boxplot result and histogram. But we will keep it to evaluate the results.
  • ### 2.1 Ensure the attribute types are correct. If not, take appropriate actions.
In [78]:
# Data set was corrected in previous steps
print("This is the new Dtype for the datase")
df.dtypes
This is the new Dtype for the datase
Out[78]:
age                    int64
default                int64
balance                int64
housing                int64
loan                   int64
day                    int64
duration               int64
campaign               int64
pdays                  int64
previous               int64
Target                 int64
marital_divorced       uint8
marital_married        uint8
marital_single         uint8
education_primary      uint8
education_secondary    uint8
education_tertiary     uint8
education_unknown      uint8
contact_cellular       uint8
contact_telephone      uint8
contact_unknown        uint8
poutcome_failure       uint8
poutcome_other         uint8
poutcome_success       uint8
poutcome_unknown       uint8
job_admin.             uint8
job_blue-collar        uint8
job_entrepreneur       uint8
job_housemaid          uint8
job_management         uint8
job_retired            uint8
job_self-employed      uint8
job_services           uint8
job_student            uint8
job_technician         uint8
job_unemployed         uint8
job_unknown            uint8
month_apr              uint8
month_aug              uint8
month_dec              uint8
month_feb              uint8
month_jan              uint8
month_jul              uint8
month_jun              uint8
month_mar              uint8
month_may              uint8
month_nov              uint8
month_oct              uint8
month_sep              uint8
dtype: object

Observation

  • As part of the process of preparing the data, the coulmn Target was modified in previous steps: Yes =1 and No =0
  • Also One-Hot encoding was done above in order to prepare the data for the model
  • ### 2.2 Create the training set and test set in ratio of 70:30
In [80]:
# from sklearn.model_selection import train_test_split <<  this is the library that will be used (loaded at the beginning)

X = df.drop('Target',axis=1)     # Predictor feature columns (27 X m)
Y = df['Target']  # target variable(1 X m
In [81]:
##Split into training and test set
x_train, x_test,y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) # 1 is just any random seed number
print (x_train.shape, x_test.shape)
x_train.head()
(31647, 48) (13564, 48)
Out[81]:
age default balance housing loan day duration campaign pdays previous ... month_dec month_feb month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep
6149 32 1 -238 1 0 27 427 1 -1 0 ... 0 0 0 0 0 0 1 0 0 0
12403 34 0 -478 1 1 27 111 4 -1 0 ... 0 0 0 0 1 0 0 0 0 0
21645 32 0 266 1 0 19 168 2 -1 0 ... 0 0 0 0 0 0 0 0 0 0
29580 36 1 13 0 1 3 150 4 -1 0 ... 0 1 0 0 0 0 0 0 0 0
31245 23 0 486 0 0 3 87 1 -1 0 ... 0 0 0 0 0 1 0 0 0 0

5 rows × 48 columns

In [12]:
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 31647 entries, 6149 to 33003
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   age                31647 non-null  int64
 1   marital            31647 non-null  int64
 2   education          31647 non-null  int64
 3   default            31647 non-null  int64
 4   balance            31647 non-null  int64
 5   housing            31647 non-null  int64
 6   loan               31647 non-null  int64
 7   contact            31647 non-null  int64
 8   day                31647 non-null  int64
 9   month              31647 non-null  int64
 10  duration           31647 non-null  int64
 11  campaign           31647 non-null  int64
 12  pdays              31647 non-null  int64
 13  previous           31647 non-null  int64
 14  poutcome           31647 non-null  int64
 15  job_admin.         31647 non-null  uint8
 16  job_blue-collar    31647 non-null  uint8
 17  job_entrepreneur   31647 non-null  uint8
 18  job_housemaid      31647 non-null  uint8
 19  job_management     31647 non-null  uint8
 20  job_retired        31647 non-null  uint8
 21  job_self-employed  31647 non-null  uint8
 22  job_services       31647 non-null  uint8
 23  job_student        31647 non-null  uint8
 24  job_technician     31647 non-null  uint8
 25  job_unemployed     31647 non-null  uint8
 26  job_unknown        31647 non-null  uint8
dtypes: int64(15), uint8(12)
memory usage: 4.2 MB
In [177]:
x_test.head() # this is to review the columns 
Out[177]:
age marital education default balance housing loan contact day month ... job_entrepreneur job_housemaid job_management job_retired job_self-employed job_services job_student job_technician job_unemployed job_unknown
3610 42 1 1 0 2519 1 0 -1 15 5 ... 0 0 0 0 0 0 0 0 0 0
11677 37 1 2 0 2209 0 0 -1 20 6 ... 0 0 0 0 0 0 0 1 0 0
33018 32 1 2 0 923 1 0 1 17 4 ... 0 0 0 0 0 0 0 0 0 0
44323 53 1 1 0 306 0 0 1 28 7 ... 0 0 0 0 0 0 0 0 0 0
8119 32 2 3 0 257 1 0 -1 2 6 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 27 columns

In [82]:
y_test.head() # this is to make sure the split was done properly and that there is not string in the column.
Out[82]:
3610     0
11677    0
33018    0
44323    1
8119     0
Name: Target, dtype: int64

Checking the split of the data

In [83]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(df.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(df.index)) * 100))
70.00% data is in training set
30.00% data is in test set
  • ### 3.1.a) Logistic Regression model
In [84]:
#The following libraries will be used ( already imported at the beginning)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
# #from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(random_state=1,fit_intercept=False)
model.fit(x_train, y_train)
Out[84]:
LogisticRegression(fit_intercept=False, random_state=1)
In [85]:
y_predict = model.predict(x_test)              # Predicting the target variable on test data
In [86]:
# Observe the predicted and observed classes in a dataframe.

z = x_test.copy()
z['Observed Target'] = y_test
z['Predicted Target'] = y_predict
z.head()
Out[86]:
age default balance housing loan day duration campaign pdays previous ... month_jan month_jul month_jun month_mar month_may month_nov month_oct month_sep Observed Target Predicted Target
3610 42 0 2519 1 0 15 262 4 -1 0 ... 0 0 0 0 1 0 0 0 0 0
11677 37 0 2209 0 0 20 167 2 -1 0 ... 0 0 1 0 0 0 0 0 0 0
33018 32 0 923 1 0 17 819 4 -1 0 ... 0 0 0 0 0 0 0 0 0 0
44323 53 0 306 0 0 28 388 3 181 1 ... 0 1 0 0 0 0 0 0 1 0
8119 32 0 257 1 0 2 183 5 -1 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 50 columns

In [88]:
print("Trainig accuracy =",model.score(x_train,y_train))  
print()
print("Testing accuracy =",model.score(x_test, y_test))
print()
print("Recall = ",recall_score(y_test,y_predict))
print()
print("Precision = ",precision_score(y_test,y_predict))
print()
print("F1 Score =",f1_score(y_test,y_predict))
print()
print("Roc Auc Score =",roc_auc_score(y_test,y_predict))
Trainig accuracy = 0.891332511770468

Testing accuracy = 0.8912562665880271

Recall =  0.2005157962604771

Precision =  0.5695970695970696

F1 Score = 0.2966142107773009

Roc Auc Score = 0.5904768276232877
In [89]:
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])

df_cm = pd.DataFrame(cm, index = [i for i in ["Observed 1","Observed 0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)
print('Confusion Matrix')
print(model.score(x_test, y_test))
Confusion Matrix
0.8912562665880271
In [90]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Insight

The recall for this logistic regression is low (20%) whereas the precision is 56%. Still these number and we expect to improve them with ensamble techniques.

In [91]:
## Feature Importance or Coefficients 
fi = pd.DataFrame()
fi['Col'] = x_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
Out[91]:
Col Coeff
3 housing 0.78
19 contact_unknown 0.49
23 poutcome_unknown 0.49
44 month_may 0.48
22 poutcome_success 0.36
7 campaign 0.34
14 education_secondary 0.31
25 job_blue-collar 0.29
4 loan 0.22
29 job_retired 0.20
11 marital_married 0.19
9 previous 0.18
20 poutcome_failure 0.16
12 marital_single 0.15
46 month_oct 0.12
41 month_jul 0.10
31 job_services 0.10
17 contact_cellular 0.09
43 month_mar 0.09
33 job_technician 0.09
47 month_sep 0.09
13 education_primary 0.08
45 month_nov 0.07
18 contact_telephone 0.06
15 education_tertiary 0.05
42 month_jun 0.05
32 job_student 0.04
21 poutcome_other 0.04
26 job_entrepreneur 0.04
36 month_apr 0.04
39 month_feb 0.03
38 month_dec 0.03
0 age 0.03
1 default 0.03
24 job_admin. 0.03
30 job_self-employed 0.02
37 month_aug 0.02
40 month_jan 0.02
10 marital_divorced 0.02
27 job_housemaid 0.01
16 education_unknown 0.01
5 day 0.01
34 job_unemployed 0.00
35 job_unknown 0.00
28 job_management 0.00
8 pdays 0.00
6 duration 0.00
2 balance 0.00

Insight All the coefficient are less than 1, and housing, contanct_unknown and poutcome_unknown are the top 3 followed by month_may. 4 out of 5 top coefficient are coming from the on-hot encoding which means that it was a good choise to apply that technique to those variables. In the coming steps we can confirm is the relevance of this variable are kept in other models.

  • ### 3.1.b) Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.

In [105]:
dTree = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
dTree.fit(x_train, y_train)
Out[105]:
DecisionTreeClassifier(criterion='entropy', random_state=1)

Scoring our Decision Tree

In [106]:
print("Train: %.2f" % dTree.score(x_train, y_train))  # performance on train data
print("Test: %.2f" % dTree.score(x_test, y_test))  # performance on test data
Train: 1.00
Test: 0.88

Above we see a drop of 0.12 from the the train to the test score, this results shows overfitting in the training data

Visualizing the Decision Tree overfitted

In [107]:
### While making this project, all the below updates were done looking for tht right way to install gaphviz

#!pip install Graphviz
# pip install pydotplus
#pip install six
#pip install --upgrade mglearn
#pip install mlrose
In [108]:
from sklearn.tree import export_graphviz
#from sklearn.externals.six import StringIO 
#from six import StringIO
import six
import sys
sys.modules['sklearn.externals.six'] = six
import mlrose
from IPython.display import Image  
import pydotplus
import graphviz
In [109]:
train_char_label = ['No', 'Yes']
Credit_Tree_File = open('credit_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(x_train), class_names = list(train_char_label))
Credit_Tree_File.close()
In [110]:
retCode = system("dot -Tpng credit_tree.dot -o credit_tree.png")
if(retCode>0):
    print("system command returning error: "+str(retCode))
else:
    display(Image("credit_tree.png"))

Insight

  • The graph above is possible to see applying zoom to it. It clearly shows the overfitting of the model

Reducing over fitting (Regularization)

In [115]:
clf_pruned = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
                               max_depth=4, min_samples_leaf=5)
clf_pruned.fit(x_train, y_train)
print(clf_pruned.score(x_train, y_train))
print(clf_pruned.score(x_test, y_test))
0.8997693304262647
0.9014302565614863
In [116]:
print("Train: %.2f" % clf_pruned.score(x_train, y_train))  # performance on train data
print("Test: %.2f" % clf_pruned.score(x_test, y_test))  # performance on test data
Train: 0.90
Test: 0.90

Insight

  • The result a batter fit with a good and equal performance between the test and the train data

Visualizing the tree after pruning

In [117]:
y_train.value_counts()
Out[117]:
0    27909
1     3738
Name: Target, dtype: int64
In [118]:
dot_data = tree.export_graphviz(clf_pruned, out_file=dot_data,
                                filled=True, rounded=True,
                                special_characters=True,feature_names = list(x_train),class_names=['No', 'Yes'])
In [119]:
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png('df.png')
Image(graph.create_png())
Out[119]:
In [120]:
## Calculating feature importance

feat_importance = clf_pruned.tree_.compute_feature_importances(normalize=False)


feat_imp_dict = dict(zip(list(x_train), clf_pruned.feature_importances_))
feat_imp = pd.DataFrame.from_dict(feat_imp_dict,orient='index')
feat_imp.sort_values(by=0, ascending=False)
Out[120]:
0
duration 0.62
poutcome_success 0.27
contact_unknown 0.10
housing 0.00
campaign 0.00
age 0.00
month_apr 0.00
job_retired 0.00
job_self-employed 0.00
job_services 0.00
job_student 0.00
job_technician 0.00
job_unemployed 0.00
job_unknown 0.00
month_dec 0.00
month_aug 0.00
job_housemaid 0.00
month_feb 0.00
month_jan 0.00
month_jul 0.00
month_jun 0.00
month_mar 0.00
month_may 0.00
month_nov 0.00
month_oct 0.00
job_management 0.00
job_admin. 0.00
job_entrepreneur 0.00
education_primary 0.00
balance 0.00
loan 0.00
day 0.00
pdays 0.00
previous 0.00
marital_divorced 0.00
marital_married 0.00
marital_single 0.00
education_secondary 0.00
job_blue-collar 0.00
education_tertiary 0.00
education_unknown 0.00
contact_cellular 0.00
contact_telephone 0.00
poutcome_failure 0.00
poutcome_other 0.00
poutcome_unknown 0.00
default 0.00
month_sep 0.00
In [121]:
print (pd.DataFrame(clf_pruned.feature_importances_, columns = ["Imp"], index = x_train.columns))
                     Imp
age                 0.00
default             0.00
balance             0.00
housing             0.00
loan                0.00
day                 0.00
duration            0.62
campaign            0.00
pdays               0.00
previous            0.00
marital_divorced    0.00
marital_married     0.00
marital_single      0.00
education_primary   0.00
education_secondary 0.00
education_tertiary  0.00
education_unknown   0.00
contact_cellular    0.00
contact_telephone   0.00
contact_unknown     0.10
poutcome_failure    0.00
poutcome_other      0.00
poutcome_success    0.27
poutcome_unknown    0.00
job_admin.          0.00
job_blue-collar     0.00
job_entrepreneur    0.00
job_housemaid       0.00
job_management      0.00
job_retired         0.00
job_self-employed   0.00
job_services        0.00
job_student         0.00
job_technician      0.00
job_unemployed      0.00
job_unknown         0.00
month_apr           0.00
month_aug           0.00
month_dec           0.00
month_feb           0.00
month_jan           0.00
month_jul           0.00
month_jun           0.00
month_mar           0.00
month_may           0.00
month_nov           0.00
month_oct           0.00
month_sep           0.00

Insight

  • From the feature importance dataframe we can infer thatduration, poutcome_success and contact_unknown are the variables that impact TARGET

Decision tree performance

In [122]:
preds_train = clf_pruned.predict(x_train)
preds_test = clf_pruned.predict(x_test)

acc_DT = accuracy_score(y_test, preds_test)
In [164]:
print("Trainig accuracy =",clf_pruned.score(x_train,y_train))  
print()
print("Testing accuracy =",clf_pruned.score(x_test, y_test))
print()
print("Recall = ",recall_score(y_test,preds_test))
print()
print("Precision = ",precision_score(y_test,preds_test))
print()
print("F1 Score =",f1_score(y_test,preds_test))
print()
print("Roc Auc Score =",roc_auc_score(y_test,preds_test))
Trainig accuracy = 0.8997693304262647

Testing accuracy = 0.9014302565614863

Recall =  0.33913604126370084

Precision =  0.6276849642004774

F1 Score = 0.44035161155295105

Roc Auc Score = 0.6565820887247498
In [123]:
# Confusion matrix
pd.crosstab(y_test, preds_test, rownames=['Actual'], colnames=['Predicted'])
#NO =0 and YES = 1
Out[123]:
Predicted 0 1
Actual
0 11701 312
1 1025 526
In [124]:
print(clf_pruned.score(x_test , y_test))
y_predict = clf_pruned.predict(x_test)

cm=metrics.confusion_matrix(y_test, y_predict, labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.9014302565614863
Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c92bd083a0>

*Insight

  • When the tree is regularised, overfitting is reduced, but there is no increase in accuracy
In [133]:
# Creating a function for visualizing classifier results

from yellowbrick.classifier import ClassificationReport, ROCAUC
def visClassifierResults(model_w_parameters):
    viz = ClassificationReport(model_w_parameters)
    viz.fit(x_train, y_train)
    viz.score(x_test, y_test)
    viz.show()

    roc = ROCAUC(model_w_parameters)
    roc.fit(x_train, y_train)
    roc.score(x_test, y_test)
    roc.show()
In [134]:
visClassifierResults(DecisionTreeClassifier(criterion = "entropy", max_depth=4))
In [169]:
#Store the accuracy results for each model in a dataframe for final comparison
resultsDf = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': acc_DT,
                        'Recall': recall_score(y_test,preds_test),
                         'Precision':precision_score(y_test,preds_test),
                         'F1 Score':f1_score(y_test,preds_test),
                         'Roc Auc Score':roc_auc_score(y_test,preds_test)})
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Out[169]:
Method accuracy Recall Precision F1 Score Roc Auc Score
0 Decision Tree 0.90 0.34 0.63 0.44 0.66

Performance metrics

  • Precision: Fraction of actuals per label that were correctly classified by the model
  • Recall: Fraction of predictions that were correctly classified by the model
  • F1-score: Weighted harmonic mean of the precision and recall. F1-score: 2 (precision recall) / (precision + recall)
  • Support: Number of occurrences of each class in y_test
  • Accuracy: Fraction of all observations that were correctly classified by the model
  • Macro avg: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account
  • Micro/weighted avg: Calculate metrics globally by counting the total true positives, false negatives and false positives
  • AUC Score: Given a random observation from the dataset that belongs to a class, and a random observation that doesn't belong to a class, the AUC is the perecentage of time that our model will classify which is which correctly
  • ### 3.2 Build the ensemble models (Bagging and Boosting) and note the model performance by using different matrices. Use same metrics as in above model. (at least 3 algorithms) (15 marks)

Apply the Random forest model and print the accuracy of Random forest Model

In [136]:
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassificationReport, ROCAUC

rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(x_train, y_train)
In [148]:
pred_RF = rfcl.predict(x_test)
acc_RF = accuracy_score(y_test, pred_RF)
In [143]:
#y_predict = rfcl.predict(x_test)
print(rfcl.score(x_test, y_test))
cm=metrics.confusion_matrix(y_test, pred_RF ,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.907770569153642
Out[143]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c938df7ca0>
In [139]:
visClassifierResults(RandomForestClassifier(n_estimators = 50))
In [170]:
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Random Forest'], 'accuracy': [acc_RF],
                        'Recall': recall_score(y_test,pred_RF),
                         'Precision':precision_score(y_test,pred_RF),
                         'F1 Score':f1_score(y_test,pred_RF),
                         'Roc Auc Score':roc_auc_score(y_test,pred_RF)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Out[170]:
Method accuracy Recall Precision F1 Score Roc Auc Score
0 Decision Tree 0.90 0.34 0.63 0.44 0.66
0 Random Forest 0.91 0.40 0.66 0.49 0.68

Compared to the decision tree, we can see that the accuracy has improved 1% for the Random forest model

Apply Adaboost Ensemble Algorithm for the same data and print the accuracy.

In [152]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(n_estimators = 100, learning_rate=0.1, random_state=22)
abcl = abcl.fit(x_train, y_train)
In [153]:
pred_AB =abcl.predict(x_test)
acc_AB = accuracy_score(y_test, pred_AB)
In [171]:
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Adaboost'], 'accuracy': [acc_AB],
                        'Recall': recall_score(y_test,pred_AB),
                         'Precision':precision_score(y_test,pred_AB),
                         'F1 Score':f1_score(y_test,pred_AB),
                         'Roc Auc Score':roc_auc_score(y_test,pred_AB)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Out[171]:
Method accuracy Recall Precision F1 Score Roc Auc Score
0 Decision Tree 0.90 0.34 0.63 0.44 0.66
0 Random Forest 0.91 0.40 0.66 0.49 0.68
0 Adaboost 0.90 0.21 0.67 0.32 0.60
In [155]:
y_predict = abcl.predict(x_test)
print(abcl.score(x_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.8977440283102329
Out[155]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c939e5d820>
In [156]:
visClassifierResults(AdaBoostClassifier(n_estimators= 100, learning_rate=0.1, random_state=22))

Apply Bagging Classifier Algorithm and print the accuracy.

In [157]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl = bgcl.fit(x_train, y_train)
In [158]:
pred_BG = bgcl.predict(x_test)
acc_BG = accuracy_score(y_test, pred_BG)
In [172]:
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Bagging'], 'accuracy': [acc_BG],
                        'Recall': recall_score(y_test,pred_BG),
                         'Precision':precision_score(y_test,pred_BG),
                         'F1 Score':f1_score(y_test,pred_BG),
                         'Roc Auc Score':roc_auc_score(y_test,pred_BG)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Out[172]:
Method accuracy Recall Precision F1 Score Roc Auc Score
0 Decision Tree 0.90 0.34 0.63 0.44 0.66
0 Random Forest 0.91 0.40 0.66 0.49 0.68
0 Adaboost 0.90 0.21 0.67 0.32 0.60
0 Bagging 0.91 0.49 0.62 0.55 0.73
In [174]:
y_predict = bgcl.predict(x_test)
print(bgcl.score(x_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.9071807726334414
Out[174]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c938e12d90>

Insight

  • This heatmap is showing the lowest False negative number so far, with 792.
In [160]:
visClassifierResults (BaggingClassifier(n_estimators=50, max_samples= .7, bootstrap=True, oob_score=True, random_state=22))

Apply GradientBoost Classifier Algorithm for the same data and print the accuracy

In [161]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=22)
gbcl = gbcl.fit(x_train, y_train)
In [162]:
pred_GB = gbcl.predict(x_test)
acc_GB = accuracy_score(y_test, pred_GB)
In [175]:
y_predict = gbcl.predict(x_test)
print(gbcl.score(x_test , y_test))

cm=metrics.confusion_matrix(y_test, y_predict,labels=[0, 1])

df_cm = pd.DataFrame(cm, index = [i for i in ["No","Yes"]],
                  columns = [i for i in ["No","Yes"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True ,fmt='g')
0.9050427602477146
Out[175]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c93bf66550>
In [246]:
visClassifierResults(GradientBoostingClassifier(n_estimators = 50, learning_rate = 0.1, random_state=22))
  • ### 3.3 Make a DataFrame to compare models and their metrics.
In [173]:
#Store the accuracy results for each model in a dataframe for final comparison
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost'], 'accuracy': [acc_GB],
                        'Recall': recall_score(y_test,pred_GB),
                         'Precision':precision_score(y_test,pred_GB),
                         'F1 Score':f1_score(y_test,pred_GB),
                         'Roc Auc Score':roc_auc_score(y_test,pred_GB)})
resultsDf = pd.concat([resultsDf, tempResultsDf])
resultsDf = resultsDf[['Method', 'accuracy','Recall','Precision','F1 Score','Roc Auc Score']]
resultsDf
Out[173]:
Method accuracy Recall Precision F1 Score Roc Auc Score
0 Decision Tree 0.90 0.34 0.63 0.44 0.66
0 Random Forest 0.91 0.40 0.66 0.49 0.68
0 Adaboost 0.90 0.21 0.67 0.32 0.60
0 Bagging 0.91 0.49 0.62 0.55 0.73
0 Gradient Boost 0.91 0.36 0.65 0.46 0.67

Insight

  • For this dataset Random forest and bagging models give the higher acuracy, however bagging has the highest recall we can see in the table above.

The main objective of this project is to design a model that helps the marketing team identify potential customers who are elatively more likely to subscribe term deposit and thus increase their hit ratio. This goal can be interpreted as that the bank wants to increase the number of possitve answers once the client is contacted, for which we should identify which parameters will help us to be more accurate selecting the clients to be contacted.

The best model was Bagging classifier with a reacll of 0.49. Our model’s test performance (AUC) is 73%.

Important Metric
  • As mentioned above, the bank wants to increase the number of people to subscribe to term deposit i.e. less number of False Negative, if FN is high, the bank would lose the chance to increase the hit ratio. Hence Recall is the important metric as per the context of this classification excercise.

  • The precision of bagging is actually the lowest, for which if in any case the bank consider that is not covenient for any reason give the term deposit plan to the wrong person, then precision would play a more important role and in that case other model can be chosen

  • In the hetmaps is possible to observe the value of the False Negative (prediected negative and observed positive) is around 1000 for all the models except for bagging which has 792 counts. This is the reason for which we have better Recall with this algorithm for this particular excercise.

  • The ensmable techniques showed better results than the logistic regression where the recall was less than 20%.
  • The variable duration could be excluded (dropped) from the calculation and the models re-runned. This variable has zeros and I am not sure how these zeros are considered by the model.
  • Likewise with the variable poutcome, it has 82% of unknown and the begining but it is worth to try to run the model without this variable.
  • Contact communication type proved to be a key player in some models, the marketing team should consider incorporate more type of communcation like social media and emails.

In this excercise one-hot encoding was applied only to the variable "job""marital","education","contact","poutcome","month". Wheras all the boolean columns were changed by 1 and 0. It will be a good excercise to try to apply one-hot encoding to all the categorical variables and compare the results of the model. Applying one-hot encoding to all the categorical will evaluate the option that one particualar clasification is an important coefficient.

The command: sns.pairplot(df, hue = 'Target'), was the process that took more resources from the computer (longer time to calculate) and didn't finished every time it was runned. In real life where more colums are available this step should be reviwed carefully before let it run.

In [ ]: